The purpose of this vignette is to explore relationship of risk with other factors of World Bank’s projects.
Risk is somewhat related to the region in which a project is being implemented. However, there are other, stronger predictors of overall risk rating, such as: the tenure of the project’s TTL (experienced ones are more likely to be trusted with high-risk projects), the scale of the project (as represented by the amounts committed), and the year when projects were approved (as a proxy of world events, changes in the Bank’s risk tolerance, etc.)
The strongest relationship was found to be in East Asia Pacific and macroeconomic and political risks. In these countries such risks are expected to be lower, as the correlations are negative at -0.23 and -0.27. The visualizations are to be found in the analysis below.
Overall risk rating is not a linear function of other specific types of risk. Institutional capacity is a type of risk that contributes the most to the overall risk rating accoding to two disticnt measures.
Fragility and conflict indicator is negatively related to the net value of the World Bank loans. It is hard to provide loans if it is uncertain whether the loanees will be in place once a situation is resolved.
The following packages are used:
library(exploratory)
library(janitor)
library(lubridate)
library(hms)
library(tidyr)
library(stringr)
library(readr)
library(forcats)
library(RcppRoll)
library(dplyr)
library(tibble)
library(rio)
library(plotly)
library(reshape2)
library(alluvial)
library(caret)
The provided datasets — project_data and risk_data — come in wide and long formats, respectively.
We bring these datasets together in a tidy format, where each column is a variable, and each row is a unique observation. In this case, the unique identifier is project_id.
# Creating a numeric representation of risk as `risk_numeric` to facilitate computation in R.
# This can be Low (1), Moderate (2), Substantial (3), or High (4)
risk_data <- risk_data %>%
mutate(risk_rating =
ifelse(risk_rating==c("L"),1,
ifelse(risk_rating==c("M"),2,
ifelse(risk_rating==c("S"),3,
ifelse(risk_rating==c("H"),4,0)))))
# Reshaping the data and cleaning up
risk_data_wide <- dcast(risk_data,
project_id + risk_rating_sequence ~ risk_rating_code,
value.var="risk_rating",
fun.aggregate=mean)
# Joining the data and getting rid of variables that have no variation
joined_data <- project_data %>%
inner_join(risk_data_wide, by = c("project_id" = "project_id")) %>%
select(-scale_up, -len_instr_type)
Disclaimer: The below analysis was performed on the entire universe of the data. Such risks as political, governance, and macro, may change momentarily once there is new administration in place. Therefore, the produced insights should be treated as generalizations from historical data.
Upon completing exploratory data analysis, we conclude that it will be optimal to produce an alluvial plot to visualize the many relationships between overall risk and regions.
Assumption 1: All risk evaluations are performed by staff, subjectively.
Overallrisk category is a qualitative assessment of the risk, which may or may not be a disctinct function of other types of risk.
Assumption 2: When making a decision on a project,
Overallrisk is the key factor. We will focus on it first, proceeding to study subcategories of risk later.
# Prepare data for visualization
joined_data_freq <- joined_data %>%
group_by(risk_overall, region, fcs_indicator, proj_emrg_recvry_flg) %>%
summarise(freq=n()) %>%
filter(region != "OTH") %>%
arrange(desc(region))
# Create an alluvial chart
alluvial(joined_data_freq[,1:4],
freq=joined_data_freq$freq,
border=NA,
hide = joined_data_freq$freq < quantile(joined_data_freq$freq, 0.5),
col=ifelse(joined_data_freq$risk_overall == "4", "red",
ifelse(joined_data_freq$risk_overall == "3", "orange",
ifelse(joined_data_freq$risk_overall == "2", "cyan", "blue"))))
TODO legend
The above figure shows in color how different levels of risk are distributed across regions. Including those facing fragile, conflict, or emergency situations. These situations have been chosen to accompany regions because they are intrinsically related to specific geographies.
While it presents lot of information in a compact format, staff familiar with the dataset will be able to read it easily. For instance, it can be seen that:
Substantial (3) risk rating. The second most frequent rating is Moderate. This shows most risk assessors refrain from making extreme judgements.Substantial (3) and High (4) risk projects span across the regions in a way close to a normal distribution. Percentage breakdown of overall risk by region can be found below.Substantial (3) or High (4) risk.# TODO fix the table
The takeaway of this table is that proportions of projects of each risk category are approximately the same across all regions. This is a piece of evidence showing that risk is not substantially or exclusively related to the region in which a project is being implemented.
To better our understand of the relationship between risk and region we will look for more patterns inside the data.
Let’s create a correlation matrix that will help us understand which variables are similar based on how the underlying data varies.
# Performing one-hot encoding of the region and other variables. The purpose is to transform each value of each categorical feature into a binary feature {0, 1}
j_data_corr <- as.data.frame(joined_data)
j_data_corr$region <- as.factor(j_data_corr$region)
for(level in unique(j_data_corr$region)){
j_data_corr[paste("reg", level, sep = "_")] <- ifelse(j_data_corr$region == level, 1, 0)
}
j_data_corr$fcs_indicator <- ifelse(j_data_corr$fcs_indicator == "Y", 1, 0)
j_data_corr$proj_emrg_recvry_flg <- ifelse(j_data_corr$proj_emrg_recvry_flg == "Y", 1, 0)
j_data_corr <- j_data_corr %>% select(-tl)
# Running Spearman correlation analysis
var_corrs <- j_data_corr %>%
do_cor(which(sapply(., is.numeric)),
use = "pairwise.complete.obs",
method = "spearman",
distinct = FALSE,
diag = TRUE)
There are very few strong positive correlations in the above plot. The strongest of them are including an indicator of a project being in a fragile or conflict situation.
The 0.25 correlation between fcs_indicator and risk_overall shows that fragile locationas are somewhat associated with higher risk. While the relationship is positive and exists, it is weak.
The 0.25 correlation between African and Political and Governance risk hints at instability in the region, especially relative to the -0.28 political risk in East Asia Pacific. Similar situation is observed for macroeconomic risk in these regions. These are the strongest relationships between risk and region.
Other regions do now show as strong of a relationship with different types of risk.
We also observe few strongly negative correlations:
The -0.29 correlation between fcs_indicator and net_commit_amt, net value of the World Bank loans. It is hard to provide loans if it is uncertain whether the loanees will be in place once a situation is resolved.
The -0.95 correlation between approval_fy and risk_rating_sequence is trivial: there is less chance to assess the projects later in the lifycycle for the more recent ones.
The -0.62 correlation between net_commit_amt and grant is staightforward as well: the Bank is more likely to provide one type of aid over the other, either a loan or a grant. However, some exceptions apply.
Having these preliminary results we proceed to an advanced analysis.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 0 4 0 0
## 2 5 52 47 5
## 3 6 71 140 18
## 4 0 1 12 17
##
## Overall Statistics
##
## Accuracy : 0.5529
## 95% CI : (0.5012, 0.6038)
## No Information Rate : 0.5265
## P-Value [Acc > NIR] : 0.1639
##
## Kappa : 0.2106
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.00000 0.4062 0.7035 0.42500
## Specificity 0.98910 0.7720 0.4693 0.96154
## Pos Pred Value 0.00000 0.4771 0.5957 0.56667
## Neg Pred Value 0.97059 0.7175 0.5874 0.93391
## Prevalence 0.02910 0.3386 0.5265 0.10582
## Detection Rate 0.00000 0.1376 0.3704 0.04497
## Detection Prevalence 0.01058 0.2884 0.6217 0.07937
## Balanced Accuracy 0.49455 0.5891 0.5864 0.69327
We have built a model of overall risk, and the above results pertain to 30% data that was held off to test it.
The results show that using XGBoost allows for a 0.69 accuracy of prediction of High risk. For both risk levels 2-3 the accuracy is 0.58, better than random assignment. The accuracy for risk level 1 is around 0.49. This is a good result, given that we are concerned the most with higher levels of risk. In this scenario it is more valuable to accurately predict high risks and potentially have false alarms, instead of missing a high risk completely. Moreover, there are only 11 cases of risk level 1, which is a cause of learning problems for the algorithm.
#m_xgb_coef$importance <- format(m_xgb_coef$importance, scientific=F)
options("scipen"=100, "digits"=4)
p <- m_xgb_coef %>% filter(importance>0.00683778) %>%
ggplot(aes(x = reorder(feature, importance),
y = importance,
fill = importance)) +
coord_flip() +
scale_fill_gradient(low = "gray", high = "red", "Variable\nimportance") +
geom_bar(stat = "identity") +
theme_bw() +
xlab("Variable names") +
ylab("")
ggplotly(p, tooltip=c("y"))
According to the plot above, the most important features in this dataset to predict risk are divided in two clusters:
High importance:
Low importance:
To answer our question, some regions are better predictors of risk than others. The modelling exercise confirms the above considerations that East Asia Pacific (EAP) is a stronger predictor that some other regions. However, a new piece of information is that ECA is also noticeable. All of them, still, are in the low importance cluster.
In Part 1 we focused on the overall risk measure and its relation to region. We are tasked with exploring the association of risk and other measures. However, this has already been covered in the visualizations and analyses in Part 1. The results are valuable, yet there is an even more insightful discovery below.
For this section, we have prepared an analysis of how different types of risk influence the measure of overall risk. Thus, we use only types of risks to preduct the overall risk rating. This cannot be used in applied environments, because it presents a data leakage. In this case it is intentinal – to show that contribution of different types of risk is not even.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 2 6 0 0
## 2 9 86 16 0
## 3 0 36 172 17
## 4 0 0 11 23
##
## Overall Statistics
##
## Accuracy : 0.749
## 95% CI : (0.702, 0.792)
## No Information Rate : 0.526
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.564
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.18182 0.672 0.864 0.5750
## Specificity 0.98365 0.900 0.704 0.9675
## Pos Pred Value 0.25000 0.775 0.764 0.6765
## Neg Pred Value 0.97568 0.843 0.824 0.9506
## Prevalence 0.02910 0.339 0.526 0.1058
## Detection Rate 0.00529 0.228 0.455 0.0608
## Detection Prevalence 0.02116 0.294 0.595 0.0899
## Balanced Accuracy 0.58273 0.786 0.784 0.7712
Here we observe that the #1 contributor to overall risk is institutional capacity.
We run the correlation analysis again to magnify the relationships between types of risk.
We indeed see that institutional capacity is the highest correlated risk with regards to overall risk at 0.63 correlation.
It is interesting that while fudiciary risk is also rather correlated with overall risk at 0.52 correlation, it is located in the low-value cluster in XGBoost variable importance analysis above.
According to it, second and third runner-ups are political and stakeholder risks. Among with fudiciary, technical design, environment, sector strategies, and macroeconomics are less important for overall risk.
It is logical to see a positive association of 0.51 between fudiciary risk and institutional capacity. There is an assocition between in-country trustees and strong institutions, and vise versa.
Interestingly, environment risks have near-zero association with macroeconomic threats. One would expect that macro activity would threaten the environment due to means used to support it not eco-friendly for the most part.